36 research outputs found

    Scheduling data flow program in xkaapi: A new affinity based Algorithm for Heterogeneous Architectures

    Get PDF
    Efficient implementations of parallel applications on heterogeneous hybrid architectures require a careful balance between computations and communications with accelerator devices. Even if most of the communication time can be overlapped by computations, it is essential to reduce the total volume of communicated data. The literature therefore abounds with ad-hoc methods to reach that balance, but that are architecture and application dependent. We propose here a generic mechanism to automatically optimize the scheduling between CPUs and GPUs, and compare two strategies within this mechanism: the classical Heterogeneous Earliest Finish Time (HEFT) algorithm and our new, parametrized, Distributed Affinity Dual Approximation algorithm (DADA), which consists in grouping the tasks by affinity before running a fast dual approximation. We ran experiments on a heterogeneous parallel machine with six CPU cores and eight NVIDIA Fermi GPUs. Three standard dense linear algebra kernels from the PLASMA library have been ported on top of the Xkaapi runtime. We report their performances. It results that HEFT and DADA perform well for various experimental conditions, but that DADA performs better for larger systems and number of GPUs, and, in most cases, generates much lower data transfers than HEFT to achieve the same performance

    Fibers are not (P)Threads: The Case for Loose Coupling of Asynchronous Programming Models and MPI Through Continuations

    Full text link
    Asynchronous programming models (APM) are gaining more and more traction, allowing applications to expose the available concurrency to a runtime system tasked with coordinating the execution. While MPI has long provided support for multi-threaded communication and non-blocking operations, it falls short of adequately supporting APMs as correctly and efficiently handling MPI communication in different models is still a challenge. Meanwhile, new low-level implementations of light-weight, cooperatively scheduled execution contexts (fibers, aka user-level threads (ULT)) are meant to serve as a basis for higher-level APMs and their integration in MPI implementations has been proposed as a replacement for traditional POSIX thread support to alleviate these challenges. In this paper, we first establish a taxonomy in an attempt to clearly distinguish different concepts in the parallel software stack. We argue that the proposed tight integration of fiber implementations with MPI is neither warranted nor beneficial and instead is detrimental to the goal of MPI being a portable communication abstraction. We propose MPI Continuations as an extension to the MPI standard to provide callback-based notifications on completed operations, leading to a clear separation of concerns by providing a loose coupling mechanism between MPI and APMs. We show that this interface is flexible and interacts well with different APMs, namely OpenMP detached tasks, OmpSs-2, and Argobots.Comment: 12 pages, 7 figures Published in proceedings of EuroMPI/USA '20, September 21-24, 2020, Austin, TX, US

    Correlated Set Coordination in Fault Tolerant Message Logging Protocols

    Full text link
    Abstract. Based on our current expectation for the exascale systems, composed of hundred of thousands of many-core nodes, the mean time between failures will become small, even under the most optimistic as-sumptions. One of the most scalable checkpoint restart techniques, the message logging approach, is the most challenged when the number of cores per node increases, due to the high overhead of saving the message payload. Fortunately, for two processes on the same node, the failure probability is correlated, meaning that coordinated recovery is free. In this paper, we propose an intermediate approach that uses coordination between correlated processes, but retains the scalability advantage of message logging between independent ones. The algorithm still belongs to the family of event logging protocols, but eliminates the need for costly payload logging between coordinated processes.

    Recovery Patterns for Iterative Methods in a Parallel Unstable Environment

    No full text
    Several recovery techniques for parallel iterative methods are presented. First, the implementation of checkpoints in parallel iterative methods is described and analyzed. Then, a simple checkpoint-free fault tolerant scheme for parallel iterative methods, the lossy approach, is presented. When one processor fails and all its data is lost, the system is recovered by computing a new approximate solution using the data of the non-failed processors. The iterative method is then restarted with this new vector. The main advantage of the lossy approach over standard checkpoint algorithms is that it does not increase the computational cost of the iterative solver, when no failure occurs. Experiments are presented that compare the different techniques. The fault tolerant FT-MPI library is used. Both iterative linear solvers and eigensolvers are considered

    A Current Task-Based Programming Paradigms Analysis

    No full text
    International audienceTask-based paradigm models can be an alternative to MPI. The user defines atomic tasks with a defined input and output with the dependencies between them. Then, the runtime can schedule the tasks and data migrations efficiently over all the available cores while reducing the waiting time between tasks. This paper focus on comparing several task-based programming models between themselves using the LU factorization as benchmark. HPX, PaRSEC, Legion and YML+XMP are task-based programming models which schedule data movement and computational tasks on distributed resources allocated to the application. YML+XMP supports parallel and distributed tasks with XscalableMP, a PGAS language. We compared their performances and scalability are compared to ScaLA-PACK, an highly optimized library which uses MPI to perform communications between the processes on up to 64 nodes. We performed a block-based LU factorization with the task-based programming model on up to a matrix of size 49512 × 49512. HPX is performing better than PaRSEC, Legion and YML+XMP but not better than ScaLAPACK. YML+XMP has a better scalability than HPX, Legion and PaRSEC. Regent has trouble scaling from 32 nodes to 64 nodes with our algorithm

    Innovative Computing Laboratory,

    No full text
    Abstract. Component architectures provide a useful framework for developing an extensible and maintainable code base upon which largescale software projects can be built. Component methodologies have only recently been incorporated into applications by the High Performance Computing community, in part because of the perception that component architectures necessarily incur an unacceptable performance penalty. The Open MPI project is creating a new implementation of the Message Passing Interface standard, based on a custom component architecture – the Modular Component Architecture (MCA) – to enable straightforward customization of a high-performance MPI implementation. This paper reports on a detailed analysis of the performance overhead in Open MPI introduced by the MCA. We compare the MCA-based implementation of Open MPI with a modified version that bypasses the component infrastructure. The overhead of the MCA is shown to be low, on the order of 1%, for both latency and bandwidth microbenchmarks as well as for the NAS Parallel Benchmark suite.

    Analysis of the component architecture overhead

    No full text
    Abstract. Component architectures provide a useful framework for developing an extensible and maintainable code base upon which largescale software projects can be built. Component methodologies have only recently been incorporated into applications by the High Performance Computing community, in part because of the perception that component architectures necessarily incur an unacceptable performance penalty. The Open MPI project is creating a new implementation of the Message Passing Interface standard, based on a custom component architecture – the Modular Component Architecture (MCA) – to enable straightforward customization of a high-performance MPI implementation. This paper reports on a detailed analysis of the performance overhead in Open MPI introduced by the MCA. We compare the MCA-based implementation of Open MPI with a modified version that bypasses the component infrastructure. The overhead of the MCA is shown to be low, on the order of 1%, for both latency and bandwidth microbenchmarks as well as for the NAS Parallel Benchmark suite.
    corecore